93 research outputs found
Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval
We address the problem of image-to-video retrieval. Given a query image, the aim is to identify the frame or scene within a collection of videos that best matches the visual input. Matching images to videos is an asymmetric task in which specific features for capturing the visual information in images and, at the same time, compacting the temporal correlation from videos are needed. Methods proposed so far are based on the temporal aggregation of hand-crafted features. In this work, we propose a deep learning architecture for learning specific asymmetric spatio-temporal embeddings for image-tovideo retrieval. Our method learns non-linear projections from training data for both images and videos and projects their visual content into a common latent space, where they can be easily compared with a standard similarity function. Experiments conducted here show that our proposed asymmetric spatio-temporal embeddings outperform stateof-the-art in standard image-to-video retrieval datasets
Learning Non-Metric Visual Similarity for Image Retrieval
Measuring visual similarity between two or more instances within a data distribution is a fundamental task in image retrieval. Theoretically, non-metric distances are able to generate a more complex and accurate similarity model than metric distances, provided that the non-linear data distribution is precisely captured by the system. In this work, we explore neural networks models for learning a non-metric similarity function for instance search. We argue that non-metric similarity functions based on neural networks can build a better model of human visual perception than standard metric distances. As our proposed similarity function is differentiable, we explore a real end-to-end trainable approach for image retrieval, i.e. we learn the weights from the input image pixels to the final similarity score. Experimental evaluation shows that non-metric similarity networks are able to learn visual similarities between images and improve performance on top of state-of-the-art image representations, boosting results in standard image retrieval datasets with respect standard metric distances
Spatial and temporal representations for multi-modal visual retrieval
This dissertation studies the problem of finding relevant content within a visual collection according to a specific query by addressing three key modalities: symmetric visual retrieval, asymmetric visual retrieval and cross-modal retrieval, depending on the kind of data to be processed. In symmetric visual retrieval, the query object and the elements in the collection are from the same kind of visual data, i.e. images or videos. Inspired by the human visual perception system, we propose new techniques to estimate visual similarity in image-to-image retrieval datasets based on non-metric functions, improving image retrieval performance on top of state-of-the-art methods. On the other hand, asymmetric visual retrieval is the problem in which queries and elements in the dataset are from different types of visual data. We propose methods to aggregate the temporal information of video segments so that imagevideo comparisons can be computed using similarity functions. When compared in image-to-video retrieval datasets, our algorithms drastically reduce memory storage while maintaining high accuracy rates. Finally, we introduce new solutions for cross-modal retrieval, which is the task in which either the queries or the elements in the collection are non-visual objects. In particular, we study text-image retrieval in the domain of art by introducing new models for semantic art understanding, obtaining results close to human performance. Overall, this thesis advances the state-of-the-art in visual retrieval by presenting novel solutions for some of the key tasks in the field. The contributions derived from this work have potential direct applications in the era of big data, as visual datasets are growing exponentially every day and new techniques for storing, accessing and managing large-scale visual collections are required
Context-Aware Embeddings for Automatic Art Analysis
Automatic art analysis aims to classify and retrieve artistic representations
from a collection of images by using computer vision and machine learning
techniques. In this work, we propose to enhance visual representations from
neural networks with contextual artistic information. Whereas visual
representations are able to capture information about the content and the style
of an artwork, our proposed context-aware embeddings additionally encode
relationships between different artistic attributes, such as author, school, or
historical period. We design two different approaches for using context in
automatic art analysis. In the first one, contextual data is obtained through a
multi-task learning model, in which several attributes are trained together to
find visual relationships between elements. In the second approach, context is
obtained through an art-specific knowledge graph, which encodes relationships
between artistic attributes. An exhaustive evaluation of both of our models in
several art analysis problems, such as author identification, type
classification, or cross-modal retrieval, show that performance is improved by
up to 7.3% in art classification and 37.24% in retrieval when context-aware
embeddings are used
KnowIT VQA: Answering Knowledge-Based Questions about Videos
We propose a novel video understanding task by fusing knowledge-based and
video question answering. First, we introduce KnowIT VQA, a video dataset with
24,282 human-generated question-answer pairs about a popular sitcom. The
dataset combines visual, textual and temporal coherence reasoning together with
knowledge-based questions, which need of the experience obtained from the
viewing of the series to be answered. Second, we propose a video understanding
model by combining the visual and textual video content with specific knowledge
about the show. Our main findings are: (i) the incorporation of knowledge
produces outstanding improvements for VQA in video, and (ii) the performance on
KnowIT VQA still lags well behind human accuracy, indicating its usefulness for
studying current video modelling limitations
The Semantic Typology of Visually Grounded Paraphrases
Visually grounded paraphrases (VGPs) are different phrasal expressions describing the same visual concept in an image. Previous studies treat VGP identification as a binary classification task, which ignores various phenomena behind VGPs (i.e., different linguistic interpretation of the same visual concept) such as linguistic paraphrases and VGPs from different aspects. In this paper, we propose semantic typology for VGPs, aiming to elucidate the VGP phenomena and deepen the understanding about how human beings interpret vision with language. We construct a large VGP dataset that annotates the class to which each VGP pair belongs according to our typology. In addition, we present a classification model that fuses language and visual features for VGP classification on our dataset. Experiments indicate that joint language and vision representation learning is important for VGP classification. We further demonstrate that our VGP typology can boost the performance of visually grounded textual entailment
- …